Learning Theory
Supervised Learning
Objective
For an unknown \(c\in\mathcal{C}\), given
- \(x^{(1)},\ldots,x^{(m)}\) (“training data”), where each \(x^{(i)}\) is drawn i.i.d. from a distribution \(D\)
- \(c(x^{(1)}),\ldots,c(x^{(m)})\) (“labels”)
find a hypothesis \(h(x)\) such that
- \(h\in\mathcal{C}\)
- \(\forall i\; h(x^{(i)}) = c(x^{(i)})\)
Equivalently, define
and view learning as a relation/search problem: find \(h\in \mathrm{Cons}_{\mathcal{C}}(X)\).
(Topics listed in lecture: Supervised Learning, Occam’s Razor, Elimination Algorithm.)
Example: Databases
Let \(\mathcal{C}\) be “all databases” (i.e., all possible properties).
A naive algorithm:
- If we observe \(c(x^{(i)}) = 1\), add \(x^{(i)}\) to our database hypothesis \(h\).
Question: how does this \(h\) perform on unobserved examples?
PAC Learning
We say \(\mathcal{C}\) is PAC-learnable if there exists an algorithm such that for every \(c\in\mathcal{C}\) and every distribution \(D\), given parameters \(n,\varepsilon,\delta\) it
- uses \(m = \mathrm{poly}(n, 1/\varepsilon, 1/\delta)\) labeled examples drawn i.i.d. from \(D\), and
- runs in \(\mathrm{poly}(n, 1/\varepsilon, 1/\delta)\) time, and
- with probability at least \(1-\delta\) (“Probably”) returns \(h\in\mathcal{C}\) such that $\(\Pr_{x\sim D}[h(x)=c(x)] \ge 1-\varepsilon \quad \text{(“Approximately”).}\)$
Proof pattern (union bound over “small” hypotheses)
Define \(\hat h\) to be BAD if $\(\Pr_{x\sim D}[\hat h(x)\ne c(x)] \ge \varepsilon.\)$
Key calculation: if \(\hat h\) is BAD, then the probability it is still consistent with \(m\) i.i.d. samples is
If there are at most \(2^{B(n,m)+1}\) “small” candidate hypotheses (bounded by some description-length term \(B(n,m)\)), then by a union bound the probability that any small BAD hypothesis remains consistent is at most
Choosing
- \(k = 1/\ln 2\) (to convert between \(\ln\) and \(\log_2\)), and
- \(m\) large enough so that the above expression is \(\le \delta\)
gives the standard PAC-style sample bound.
Example: Survey / empirical validity
Suppose we have asked 20 household units at random from the city if they own or rent and what their income is.
“income > $300k or rent” and suppose all survey results satisfy this property
Let \(A_i\) be that \(i\)-th survey satisfy the above property
suppose independent
\(P(\bigcap_{i=1}^{20} A_i)=\prod^{20}_{i=1}P(A_i)=\frac{\text{\# households satisfy}}{\text{\# total householdes}}\)
Suppose less than 90% of households satisfy the property
The probability would be \(<0.9^{20}\approx0.11\)
Thus, with \(\approx89\%\) confidence that \(\geq 90\%\) households satisfy the property
More Generally
We’ll consider domains correspond to distributions on sets of possible worlds. We say \(C\) is \(1-\varepsilon\) valid if \(P_D(\text{not} C)\leq \varepsilon\). Independently draw 20 examples from \(D\), \(P(\text{all }20\text{ satisfy }C)\leq P_D(\text{not }C)^{20}\)
Learning Conjunctions
Elimination Algorithm
Initialize \(h(x)\) to contain all literals (so \(2n\) literals total).
For each example \(i\):
- If \(c(x^{(i)}) = 1\), remove from \(h\) every literal \(\ell\) such that \(\ell(x^{(i)})=0\).
Return \(h(x)\).
Theorem
Elimination runs in \(O(mn)\) time and solves \(\mathrm{Cons}_{\text{conj}}\) (when \(\mathcal{C}\) is the class of all conjunctions).
Proof idea. Initially \(h(x)\) contains all literals. If there exists a positive example where a literal is false, that literal cannot appear in the target conjunction, so it is safe to remove it. We never remove literals that truly belong to \(c(x)\) because we only remove literals on examples with \(c(x^{(i)})=1\).
A corresponding sample bound (as written in lecture) was of the form $\(m = \frac{1}{k\varepsilon}\Big(2n + \log_2\tfrac{1}{\delta} + 1\Big), \qquad k=\frac{1}{\ln 2}.\)$
Reduction from \(\mathrm{Cons}\) to PAC
Lemma
If there is a PAC-learning algorithm for a class \(\mathcal{C}\), then there is a polynomial-time algorithm for solving the corresponding consistency/search problem \(\mathrm{Cons}_{\mathcal{C}}\) (for appropriately bounded instances).
Proof
Given a consistency instance with examples $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define the distribution
Run the PAC learner on \((\mathcal{C}, n, \varepsilon=\tfrac{1}{2m}, \delta=\tfrac12)\) using samples drawn i.i.d. from this \(D\).
With probability at least \(1-\delta=1/2\), the learner outputs \(h\) such that
But under a uniform distribution over exactly \(m\) points, the error rate is an integer multiple of \(1/m\). If it is \(< 1/(2m)\), then it must be \(0\), hence \(h\) agrees with \(c\) on every \(x^{(i)}\), i.e. \(h\in\mathrm{Cons}_{\mathcal{C}}(X)\).
Consequence
Reduction from \(\mathrm{Cons}\) to PAC: if \(\mathrm{Cons}\) is NP-complete, then so is PAC-learning.
Can PAC but not Consis?
The reduction is phrased as a discrete-gap argument.
- Run PAC-learning on \((\mathcal C,n,\varepsilon=\tfrac{1}{2m},\delta=\tfrac12)\).
- Given sample set $X=\lbrace x^{(1)},\ldots,x^{(m)}\rbrace $, define
- Question written: Can \(h(x)\) satisfy PAC-learning but not satisfy \(\mathrm{Consis}^{B(n)}_{\mathcal C}\)?
If there exists an index \(i\) with \(h(x^{(i)})\ne c(x^{(i)})\), then under \(D'\) we have
so the agreement probability is
(i.e., the error is \(1/m\)). Therefore, achieving error \(\le \varepsilon=\tfrac{1}{2m}\) forces zero error on \(D'\), hence \(h\) is consistent with all \(m\) examples.